Problems with Chinchilla Approach 2
Systematic biases in scaling law inference from IsoFLOP parabola fits
Motivation
Chinchilla Approach 2 is arguably the most widely adopted method for fitting scaling laws in practice today. Introduced in the original Chinchilla paper[1], it has since been used by leading AI labs including DeepMind[1],[7] (its creators), Meta[2],[9], DeepSeek[3], Microsoft[4], Amazon[6], Waymo[8], and Arc Institute[5], among others. It is also a workhorse method for academic scaling law studies[10],[11],[12] and high-profile practitioner tutorials from researchers like Andrej Karpathy.
TODO: pad this list out futher later.
The method's appeal lies in its stability and data efficiency relative to nonlinear optimization over all loss surface parameters. Rather than fitting every parameter of the loss surface simultaneously, Approach 2 relies on second-order Taylor approximations that reduce each IsoFLOP curve to a simple parabola. This lets practitioners estimate scaling exponents, the most actionable quantities for compute allocation planning, through a sequence of straightforward polynomial and linear fits, without ever touching a nonlinear optimizer.
To our knowledge, the sensitivity of these approximations and the method's behavior on loss surfaces that are less symmetric than the original Chinchilla form (where parameter and token scaling exponents are roughly equal) have not been studied in detail. This article investigates that gap through noise-free synthetic simulations that isolate systematic biases inherent to the method itself by eliminating all sources of statistical noise.
We show how these biases affect downstream decisions like dataset size selection for final training runs at large compute budgets. We show how extrapolation errors trace back to suboptimal IsoFLOP experiment design, and that pathologies in these designs can be observed in real, high-profile scaling law studies even if they are difficult to quantify precisely. Finally, we propose an alternative fitting method that is simple, stable, and free of these biases while building on the same intuitive computational shortcut: optimizing exponential terms separately from linear terms.
Preliminaries — Loss Surface, Notation, and Fitting Methods
Neural scaling laws describe how model performance improves with compute. The Chinchilla loss surface models this relationship as:
where \(N\) is the number of parameters, \(D\) is the number of training tokens, \(E\) is the irreducible loss, and \(A, B, \alpha, \beta\) capture how quickly performance improves with scale.
Given a compute budget \(C \approx 6ND\), the optimal allocation satisfies:
Recovering the exponents \(a\) and \(b\) from empirical training runs is crucial for planning efficient large-scale training. Two canonical approaches exist:
Approach 2: IsoFLOP Parabolic Fitting
This method is presented in the Chinchilla paper. The key insight is that along a fixed-compute contour (IsoFLOP curve), loss as a function of \(\log N\) is approximately parabolic near the optimum.
- Sample IsoFLOP contours: For each compute budget \(C\), train models at various \((N, D)\) pairs satisfying \(C = 6ND\)
- Fit parabolas: For each budget, fit \(L = a(\log N)^2 + b(\log N) + c\) and extract the minimum \(N^*\)
- Fit power laws: Regress \(\log N^*\) against \(\log C\) to recover the exponent \(a\) (and similarly for \(D^*\), \(b\))
The appeal is simplicity: only polynomial fits, no nonlinear optimization. The parabolic approximation comes from a Taylor expansion of the loss surface around the optimum.
Approach 3: Direct Surface Fitting
The alternative is to fit all five parameters \((E, A, B, \alpha, \beta)\) simultaneously via nonlinear least squares. This avoids the parabolic approximation entirely but is notoriously unstable: highly sensitive to initialization and prone to converging to spurious local minima.
The Happy Path — Symmetric Surfaces
Before examining failure modes, let's establish that Approach 2 works perfectly under ideal conditions. Consider a symmetric loss surface where \(\alpha = \beta\):
With equal exponents, the optimal allocation splits compute evenly between parameters and data. The true scaling exponents are:
We sample five IsoFLOP contours spanning \(10^{17}\) to \(10^{21}\) FLOPs, fit parabolas to each, and extract the optimal token count \(D^*\).
The results confirm perfect recovery of the token scaling exponent and intercept:
| Parameter | True Value | Inferred Value | Relative Error |
|---|---|---|---|
| b (D* exponent) | 0.500000 | 0.500000 | +6.2×10⁻¹²% |
| b₀ (D* intercept) | −0.389076 | −0.389076 | −1.4×10⁻¹⁰% |
On a symmetric loss surface with perfectly crafted IsoFLOP grid sampling, Approach 2 recovers both exponents and intercepts with machine-precision accuracy. The parabolic approximation is exact when \(\alpha = \beta\).
This establishes our baseline: Approach 2 is precisely correct under ideal conditions that are unrealistic in practice. The problems arise when we deviate from these ideal conditions, as we'll see in the following sections where these conditions are perturbed in controlled ways.
Asymmetric Surfaces — Intercept and Extrapolation Errors
We repeat the exact same procedure as before: perfect sampling centers, no noise, identical methodology. The only change is that the loss surface is now asymmetric (\(\alpha \neq \beta\)).
What Happens
Simulation results show that when the loss surface is asymmetric, Approach 2 produces systematically wrong intercepts while exponents remain accurate. This isn't statistical noise; it's a deterministic bias from fitting parabolas to a non-parabolic surface.
We test two configurations to see how the effect scales:
- Chinchilla: \(\alpha = 0.34\), \(\beta = 0.28\) (ratio ≈ 1.2)
- High Imbalance: \(\alpha = 0.46\), \(\beta = 0.15\) (ratio = 3.0)
Chinchilla Surface
| Parameter | True Value | Inferred Value | Relative Error |
|---|---|---|---|
| b (D* exponent) | 0.548387 | 0.548387 | ≈ 0% |
| b₀ (D* intercept) | −0.555357 | −0.578092 | −4.1% |
High Imbalance Surface
| Parameter | True Value | Inferred Value | Relative Error |
|---|---|---|---|
| b (D* exponent) | 0.750000 | 0.750000 | ≈ 0% |
| b₀ (D* intercept) | −1.345791 | −1.459957 | −8.5% |
Why This Is Surprising
A few percent error in the intercept might seem minor, but consider that this simulation gave Approach 2 every advantage. The data is perfect: no measurement noise, with every point lying exactly on the true loss surface. The sampling is perfect too, with IsoFLOP grids centered precisely at the true optimum (something you wouldn't know how to do in practice). And the parameters are standard, taken directly from the Chinchilla paper rather than contrived to expose a potentially unrealistic weakness.
Even under these ideal conditions, Approach 2 produces biased intercepts for asymmetric surfaces. The error is systematic, a property of the parabolic approximation, not statistical noise.
Why It Happens
The IsoFLOP loss curve is not a true parabola; it contains exponential terms. When a parabola is fit to this curve, the parabola's minimum (vertex) doesn't land exactly at the true optimum. It shifts slightly, and the key insight is that this shift depends only on the loss surface shape (\(\alpha\), \(\beta\)) and the sampling grid. It does not depend on compute budget. The sampling grid size becomes important here: wider grids amplify the mismatch between the true curve and its parabolic approximation, increasing the vertex shift.
Since the vertex shift is constant across all compute budgets, it biases every inferred \(N^*\) by the same multiplicative factor. When fitting \(\log N^*\) vs \(\log C\) to extract scaling exponents:
- The slope (exponent) is unchanged: multiplying all \(N^*\) values by a constant factor adds a constant to \(\log N^*\), which doesn't affect the slope
- The intercept absorbs the entire error, biased by exactly that multiplicative factor
Exact derivation: The intercept error can be derived analytically in closed form. The parabola vertex shifts by \(\delta w\) (in log-space), giving an intercept error of:
where \(\delta w = f(\alpha, \beta, W, n)\) depends only on the surface exponents and the sampling grid (width \(W\) in log-space, number of points \(n\) per IsoFLOP curve), not on \(C\), \(E\), \(A\), or \(B\). Here \(W\) spans \(10^{-W/2}\) to \(10^{W/2}\) times the optimal \(N^*\), so \(W = 2.41\) (the XL grid) means sampling from \(\frac{1}{16}\times\) to \(16\times\) the optimum. And \(n = 10\) means 10 model sizes per compute budget. Key properties:
- \(\delta w = 0\) when \(\alpha = \beta\) (symmetric surfaces have no error)
- \(\delta w\) grows with \(|\alpha - \beta|\) (more asymmetry → more error)
- \(\delta w\) grows with \(W\) (wider sampling range → more error)
For example, with the Chinchilla parameters (\(\alpha = 0.34\), \(\beta = 0.28\)): the XS grid (\(W = 0.60\)) yields 0.3% intercept error, while the XL grid (\(W = 2.41\)) yields 4.1% error.
The full derivation provides the closed-form expression for vertex shift \(\delta w\) as a function of \(\alpha\), \(\beta\), \(W\), and \(n\). It also shows how this shift translates directly into intercept error, independent of compute budget.
Intuition via Taylor expansion: A parabola is a 2nd-order polynomial, which is equivalent to a 2nd-order Taylor expansion around the optimum. The approximation \(L(w) \approx L(0) + \frac{1}{2}L''(0)w^2\) is only valid when higher-order terms are negligible, i.e., when samples are close to the true minimum. As sampling range increases, 3rd and 4th order terms grow. For symmetric surfaces (\(\alpha = \beta\)), odd-order terms cancel by symmetry, preserving the vertex location. For asymmetric surfaces, they don't cancel, shifting the fitted vertex away from the true optimum.
Why It Matters
Extrapolation to higher compute budgets requires both exponents and intercepts to be correct. The previous section established that asymmetric loss surfaces produce provably biased intercepts even under ideal experimental conditions. Here we quantify what those errors mean in practical terms by examining compute-optimal token prediction: given a compute budget, how many tokens does the inferred scaling law predict?
Up to this point, all analysis has assumed a single fixed sampling grid width. We now examine how token prediction error varies with both compute budget and sampling grid width. For surfaces with asymmetric exponents, wider sampling grids amplify the parabola-fitting mismatch, increasing the constant vertex shift and thus the intercept bias. To make this comparison concrete, we first define what "wider" and "narrower" mean in quantitative terms.
A sampling grid of "±kx" means the sampled values (whether model sizes or token counts) range from 1⁄k to k times the true optimum at each compute budget. The total range covered is k² (the ratio of largest to smallest), and the log₁₀ of that ratio tells you how many factors of 10, or "decades," the grid spans end-to-end (e.g. a value of 1.81 means the largest sample is 101.81 ≈ 64x the smallest). The table below shows the four grid widths used in this analysis:
| Grid Name | ±kx | Sampling Range | Total Ratio | Decade Span (factors of 10) |
|---|---|---|---|---|
| Extra Small (XS) | ±2x | 1/2x to 2x | 4x | 0.60 |
| Small (S) | ±4x | 1/4x to 4x | 16x | 1.20 |
| Large (L) | ±8x | 1/8x to 8x | 64x | 1.81 |
| Extra Large (XL) | ±16x | 1/16x to 16x | 256x | 2.41 |
In practice, scaling law experiments typically sample across 1 to 2 decades in token count, placing the Small and Large grids squarely within the realistic range. The Extra Small and Extra Large grids bracket this range on either side, illustrating how the biases shrink or grow as the sampling window narrows or widens. The Extra Large grid (±16x, ~2.4 decades) is the default used in all single-grid analyses in the preceding sections.
📊 View raw data
| Surface | α | β | Grid | True D* | Inferred D* | Abs Error | Rel Error |
|---|---|---|---|---|---|---|---|
| Symmetric Surface (α = β) | |||||||
| Symmetric | 0.31 | 0.31 | XS (±2×) | 408.2B | 408.2B | ≈0 | ≈0% |
| Symmetric | 0.31 | 0.31 | S (±4×) | 408.2B | 408.2B | ≈0 | ≈0% |
| Symmetric | 0.31 | 0.31 | L (±8×) | 408.2B | 408.2B | ≈0 | ≈0% |
| Symmetric | 0.31 | 0.31 | XL (±16×) | 408.2B | 408.2B | ≈0 | ≈0% |
| Chinchilla Surface (α ≠ β) | |||||||
| Chinchilla | 0.34 | 0.28 | XS (±2×) | 4.04T | 4.02T | −13.2B | −0.33% |
| Chinchilla | 0.34 | 0.28 | S (±4×) | 4.04T | 3.98T | −52.5B | −1.30% |
| Chinchilla | 0.34 | 0.28 | L (±8×) | 4.04T | 3.92T | −117.2B | −2.90% |
| Chinchilla | 0.34 | 0.28 | XL (±16×) | 4.04T | 3.83T | −205.8B | −5.10% |
| High Imbalance Surface (α/β = 3) | |||||||
| High Imbalance | 0.465 | 0.155 | XS (±2×) | 45.1Q | 44.3Q | −755.4T | −1.67% |
| High Imbalance | 0.465 | 0.155 | S (±4×) | 45.1Q | 42.2Q | −2.9Q | −6.50% |
| High Imbalance | 0.465 | 0.155 | L (±8×) | 45.1Q | 38.8Q | −6.3Q | −13.91% |
| High Imbalance | 0.465 | 0.155 | XL (±16×) | 45.1Q | 34.7Q | −10.4Q | −23.12% |
B = billion, T = trillion, Q = quadrillion. Hover over cells for full-precision values. Training range: 10¹⁷–10²¹ FLOPs. Evaluation budget: 10²⁴ FLOPs.
The key observations from this figure are:
- Symmetric surfaces are unaffected: When \(\alpha = \beta\), all grid widths produce zero error
- Asymmetric surfaces underestimate: Negative errors mean the inferred \(D^*\) is smaller than the true \(D^*\). Following these predictions would undertrain the model
- Wider grids amplify error: Moving from XS (±2x) to XL (±16x) grids increases error from 0.3% to 5.1% on Chinchilla, and from 1.7% to 23% on High Imbalance
- Asymmetry magnifies everything: The High Imbalance surface (\(\alpha/\beta = 3\)) shows roughly 4–5x larger errors than Chinchilla at each grid width
Consider the Chinchilla surface with the Large grid (±8x), a practical sampling range for real experiments. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 4.04 trillion, but Approach 2 predicts only 3.92 trillion: a 2.9% underestimate, or roughly 117 billion fewer tokens than optimal. While 2.9% may seem modest, recall that this simulation uses unrealistically ideal conditions: perfectly centered sampling grids at every compute budget and zero measurement noise. Real experiments, where the true optimum is unknown, data is noisy, and the scaling exponent imbalance may be larger than Chinchilla's modest \(\alpha/\beta \approx 1.2\), can only do worse.
Off-Center Sampling — Exponent and Extrapolation Errors
The previous sections assumed perfectly centered sampling: at every compute budget, the IsoFLOP grid was placed exactly at the true optimum. In practice, you don't know \(N^*\) before running the experiment. Sampling centers are guesses, informed by prior estimates or heuristics, and they will inevitably be wrong by some amount.
This is a distinct source of error from the asymmetry bias examined earlier. Asymmetry errors arise from the shape of the loss surface (\(\alpha \neq \beta\)); off-center errors arise from where you place the sampling grid. To isolate this new effect, we return to the symmetric surface (\(\alpha = \beta = 0.31\)) where asymmetry bias is zero by construction.
Constant Multiplicative Bias
The simplest form of off-center sampling is a constant multiplicative offset: every compute budget's sampling center is shifted by the same factor from the true optimum. A "3× offset" means each IsoFLOP grid is centered at \(3 \times D^*\) instead of \(D^*\), so the grid midpoint consistently sits at three times the true optimal token count.
Because this offset is the same at every compute budget, it has a familiar geometric effect: each parabola vertex shifts by a constant amount in log-space. This is the same mechanism as asymmetry bias. The slope of \(\log D^*\) vs \(\log C\) is unaffected (a constant additive shift in log-space doesn't change the slope), so the scaling exponent is preserved perfectly. The intercept, however, absorbs the entire error.
The extrapolation bar chart (top right) shows what this means for token prediction: all four grid widths overestimate \(D^*\), with the narrowest grid (XS) producing the largest error. This is the reverse of the asymmetry bias pattern, where wider grids amplified error. Here, narrower grids are more sensitive to off-center placement because fewer samples lie near the true optimum.
The intercept error panel (bottom right) confirms the pattern across the full continuum of grid widths. The error is always positive (the inferred \(D^*\) overshoots) and decreases monotonically as the grid widens, reflecting how a wider sampling range brings more of the true loss curve's shape into the fit, partially compensating for the misplaced center.
Consider the symmetric surface with the Large grid (±8×) and a 3× offset, where every IsoFLOP grid is centered at three times the true optimal token count. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 408.2 billion, but Approach 2 predicts 419.0 billion: a 2.6% overestimate, roughly 10.8 billion more tokens than optimal. Compare this with the Chinchilla asymmetry result at the same grid width: a 2.9% underestimate. The magnitudes are comparable, but the sources are entirely different. Asymmetry bias comes from the shape of the loss surface; off-center bias comes from where you place the grid. In a real experiment, both act simultaneously.
Drifting Bias
TODO: When the offset grows with compute budget (e.g., prior estimates become progressively worse at higher compute), both exponents and intercepts are corrupted. This is qualitatively different from constant bias and represents a more severe failure mode. Exact presentation details are yet to be determined.
Constant bias preserves exponents; any compute-dependent bias pattern distorts them. The distinction matters because exponent errors compound during extrapolation, while intercept errors remain fixed.
References
- "Training Compute-Optimal Large Language Models," ArXiv. https://arxiv.org/abs/2203.15556
- "The Llama 3 Herd of Models," ArXiv. https://arxiv.org/abs/2407.21783
- "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism," ArXiv. https://arxiv.org/abs/2401.02954
- "Exploring Scaling Laws for EHR Foundation Models," ArXiv. https://arxiv.org/abs/2505.22964
- "Sequence modeling and design from molecular to genome scale with Evo," bioRxiv. https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2
- "Scaling Laws for Imitation Learning in Single-Agent Games," TMLR. https://arxiv.org/abs/2307.09423
- "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design," NeurIPS. https://arxiv.org/abs/2305.13035
- "Scaling Laws of Motion Forecasting and Planning -- Technical Report," ArXiv. https://arxiv.org/abs/2506.08228
- "Training compute-optimal transformer encoder models," Other. https://aclanthology.org/2025.emnlp-main.1804.pdf
- "Scaling Laws For Diffusion Transformers," ArXiv. https://arxiv.org/abs/2410.08184
- "Scaling Behavior of Discrete Diffusion Language Models," ArXiv. https://arxiv.org/abs/2512.10858
- "Scaling Laws for Compute Optimal Biosignal Transformers," Other. https://dspacemainprd01.lib.uwaterloo.ca/server/api/core/bitstreams/b66b1078-b359-4688-8dac-45e78806eb3d/content